Before comparing the two maps, we can see how CalEnviroScreen measures people with asthma per 10,000 people. They count the number of emergency department visits averaged from 2015-2017. This is important because a lot of people suffer from asthma, but not everybody enters the emergency room for it. On the other hand, PM2.5 is measured based on available air quality monitors in the area averaged from 2015-2017 in weighted averages in ug/m3. This makes it difficult to compare the two maps as the asthma rates do not directly correlate with PM2.5 concentrations. We can see this further as we perform linear and log regression below.

The apparent “best fit line” has a slight positive correlation between PM2.5 and Asthma cases but its regression value looks pretty low.

## 
## Call:
## lm(formula = Asthma ~ PM2.5, data = ces4_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.364 -25.918  -9.627  12.580 182.975 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -114.285     13.082  -8.736   <2e-16 ***
## PM2.5         19.635      1.539  12.759   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.47 on 1561 degrees of freedom
## Multiple R-squared:  0.09444,    Adjusted R-squared:  0.09386 
## F-statistic: 162.8 on 1 and 1561 DF,  p-value: < 2.2e-16

The summary of the model shows that there is little significance between the two relationships. An increase of 1.7423 ug/m3 in PM2.5 is associated with an increase of 1 Asthma case per 10,000 people. The variation in PM2.5 explains 1.15% of the variation in asthma cases. That’s not very high!

The residual distribution has a high density of residuals below 0. This means there is asymmetrical distribution as the range of values is quite high. Since the r-squared values are low, we can see that the residual errors are high. (the model also says: Residual standard error is 30.32 on 7903 degrees of freedom)

## 
## Call:
## lm(formula = log(Asthma) ~ PM2.5, data = ces4_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.00808 -0.46733  0.03068  0.41932  1.75109 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7146     0.2278   3.136  0.00174 ** 
## PM2.5         0.3541     0.0268  13.211  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6526 on 1561 degrees of freedom
## Multiple R-squared:  0.1006, Adjusted R-squared:  0.09998 
## F-statistic: 174.5 on 1 and 1561 DF,  p-value: < 2.2e-16

## [1] -54.36409
## Simple feature collection with 1 feature and 35 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.266 ymin: 37.86426 xmax: -122.254 ymax: 37.86932
## Geodetic CRS:  NAD83
## # A tibble: 1 × 36
##   `Census Tract` `Total Population` `California County`   ZIP `Approximate Loca…
##            <dbl>              <dbl> <chr>               <dbl> <chr>             
## 1     6001422800               9064 Alameda             94704 Berkeley          
## # … with 31 more variables: Longitude <dbl>, Latitude <dbl>,
## #   CES 4.0 Score <dbl>, CES 4.0 Percentile <dbl>,
## #   CES 4.0 Percentile Range <chr>, Ozone <dbl>, PM2.5 <dbl>, Diesel PM <dbl>,
## #   Drinking Water <dbl>, Lead <chr>, Pesticides <dbl>, Tox. Release <dbl>,
## #   Traffic <dbl>, Cleanup Sites <dbl>, Groundwater Threats <dbl>,
## #   Haz. Waste <dbl>, Imp. Water Bodies <dbl>, Solid Waste <dbl>,
## #   Pollution Burden <dbl>, Pollution Burden Score <dbl>, Asthma <dbl>, …

After performing a log transformation on the model, we can see that the residuals are more dense towards 0. This meant that the data better fits a “change in 1 of PM2.5 leads to a percent change in asthma cases” instead of a linear relationship of “increasing PM2.5 linearly increases the number of asthma cases”. The plotting of the residuals on the map can show where a lot of the errors can still be seen. Having negative residuals means that the predicted value is too high which can be a seen for a lot of these places. This depicts an overestimation in the predicted value.